SEMplMe: a tool for integrating DNA methylation effects in transcription factor binding affinity predictions

Motivation Aberrant DNA methylation in transcription factor binding sites has been shown to lead to anomalous gene regulation that is strongly associated with human disease. However, the majority of methylation-sensitive positions within transcription factor binding sites remain unknown. Here we introduce SEMplMe, a computational tool to generate predictions of the effect of methylation on transcription factor binding strength in every position within a transcription factor’s motif. Results SEMplMe uses ChIP-seq and whole genome bisulfite sequencing to predict effects of methylation within binding sites. SEMplMe validates known methylation sensitive and insensitive positions within a binding motif, identifies cell type specific transcription factor binding driven by methylation, and outperforms SELEX-based predictions for CTCF. These predictions can be used to identify aberrant sites of DNA methylation contributing to human disease. Availability and Implementation SEMplMe is available from https://github.com/Boyle-Lab/SEMplMe. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04865-x.

on the position within the motif [3,4]. Recent work has shown that the strength of the effect of methylation on transcription factor binding affinity varies between nucleotides within a single transcription factor motif [3,4]. It is vital to determine the specific functional impact of methylation within transcription factor binding sites as aberrant methylation is a hallmark of many human diseases, including cancer, schizophrenia, and autism spectrum disorders [5][6][7]. Methods that can better predict these effects on gene transcription can assist in identifying and prioritizing potentially harmful variations.
The effect of methylation on the binding of individual proteins has been studied in vitro using protein binding microarrays (PBMs) and newer systematic enrichment of ligands by exponential enrichment (SELEX) based methods [8][9][10][11]. Both PBMs and SELEX rely on proteins binding to DNA fragments in vitro and may not recapitulate endogenous binding patterns within the genome. A recent study of methylation sensitivity in 542 human transcription factors using a high throughput SELEX method, methyl-SELEX, found 23% of transcription factors were sensitive to methylation, 34% were enhanced by methylation, and 40% were insensitive to methylation [3]. Computational methods to analyze methyl-SELEX data, such as Methyl-Spec-Seq, provide quantitative information on the magnitude and direction of the predicted effect of methylation on transcription factor binding [12]. Additional SELEX-based studies have also observed differences in methylation sensitivity between different positions in a single motif, and is supported by evidence that some bases within transcription factor binding motifs are more correlated with disease compared to others [13,14]. However, predictions produced from these methods are limited as they rely on in vitro SELEX data and may not reflect binding patterns in a native context. Methods to determine the methylation sensitivity of transcription factors in vivo exist, however they are experimentally rigorous, or do not directly estimate methylation consequence on transcription factor binding, and are therefore challenging to use for broad interpretation [4,15,16]. A robust method to study the native context of DNA methylation within transcription factor binding sites using in vivo data is still needed to more accurately model the role these epigenetic marks play on transcriptional regulation.
To address this need, we have adapted SEMpl, a computational genome-wide transcription factor binding affinity prediction method, to incorporate whole genome bisulfite-seq (WGBS) data. This allows our predictions to include the effects of DNA methylation on binding affinity [17]. SEMpl uses open-source in vivo data to generate predictions using transcription factor binding data from ChIP-seq and open chromatin data from DNase-seq for a transcription factor of interest. The results are displayed as a SNP effect matrix providing predictions for every potential base change in a transcription factor's motif. Our SNP Effect Matrix pipeline with Methylation (SEMplMe) method expands these results by incorporating methylation data from WGBS, generating predictions that encompass the magnitude and direction of change to transcription factor binding for all 4 nucleotide base pairs, and adds two additional nucleotide letters: methylated C (M), and G opposite to a methylated C (W). This new tool provides improved specificity to determine which variants lead to disruption of transcription factor binding by integrating endogenous functional information on methylation states and transcription factor binding, advancing our ability to interrogate and prioritize mutations likely to be associated with human disease.

Usage/accessibility
SEMplMe is open source and can be downloaded from https:// github. com/ Boyle-Lab/ SEMpl Me. Precomputed SEMplMe plots are available for more than 70 transcription factors (Additional file 2: Table S1).

SNP effect matrix pipeline with methylation
SEMplMe functions as an extension of our previously published expectation-maximization method, SEMpl [17]. SEMpl uses an estimation-maximization-like algorithm to predict the consequence variation to binding in transcription factor binding sites using three data types: chromatin immunoprecipitation followed by deep sequencing (ChIPseq) data, endogenous measures of transcription factor binding genome-wide, DNase I hypersensitive site (DNase-seq) data, a measure of open chromatin genome-wide where transcription factors are known to function; and position weight matrices (PWMs), representing previous knowledge of the binding motif for a transcription factor. SEMpl uses the kmers generated from a given PWM to generate SNP kmer lists by simulating all possible in silico variants along each kmer. These kmers are then aligned to the genome in regions of open chromatin so that their analogous ChIP-seq scores can be averaged for each kmer locus with a shared nucleotide and motif position (i.e. a C in position 3 of the motif ). SEMpl used these averaged values to estimate the consequence of all possible variants in a binding site and outputs a matrix of predictions for each of the four nucleotides at each position of a transcription factor's motif. By including whole genome bisulfite sequencing (WGBS) data to the final output of SEMpl, we have expanded the interpretation of our algorithm to include the contribution of DNA methylation on transcription factor binding ( Fig. 1). Starting PWM and cell type used to generate this SEM and all other data shown in figures in this paper can be found in Additional file 2: Table S2.
In order to evaluate the consequence of DNA methylation in transcription factor binding sites we first gathered WGBS for each kmer aligned to the genome containing an in silico SNP. All data shown was generated using matched cell types for ChIP-seq, DNaseseq, and WGBS data. As the vast majority of sites in WGBS data methylation are not binary, the contribution of the proportion of methylation on binding for C and G SNPs at each position within a motif is calculated. Methylation is calculated for each aligned   n represents the total number of kmers in the list. Therefore, the equation: n k=1 (1 − (M k * S k ))/n represents the signal contribution of the non-methylated kmer.
Using this method, cytosines are divided into methylated and non-methylated components for each position within the motif of a transcription factor. Following this, all 6 nucleotides are included in a SNP effect matrix at each position along the motif of the transcription factor and plotted for an easy to visualize model of transcription factor binding (Fig. 1C). SEMplMe is written in C++ , perl and R. In addition to a the matrix file (.me.sem) and the pdf of the visualized sem (*_semplot.me.pdf ), the output also includes a matrix of standard error (.sterr) and a matrix of total ChIP-seq signal (.me.totals). New alignment and baseline files are also generated for SEMplMe (.me). A quality control file was used, which provides the -log10(P-value) of the average of 100 t-tests from 1000 randomly chosen kmers from the signal files versus the scrambled signal files from SEMpl (Additional file 2: Table S1). A threshold of 3.15 was set to report confidence in a SEM plot, with runs falling under this threshold highlighted in red.

SEMplMe sequence scoring
Scoring a full sequence with SEMplMe can be done in the same manner as PWMs or SEMpl, where the log2 score analogous to the nucleotide of interest at each position is added to reflect the predicted binding score of the sequence. This allows predictions to be made for motifs carrying more than one variant.

EMSA
Kd values for CEBPB and ATF4 were calculated from a previously published EMSA reaction by densiometric scanning by ImageJ and the Excel Solver Package [9,19,20]. All EMSA scores are represented as a ratio to the unmethylated control.

Correlation with ChIP-seq and illumina 450 k data
All kmers likely to bind CTCF were recovered from the final iteration of SEMpl. For each kmer with at least 50 occurrences, the average ChIP-seq signal and standard error were calculated. Correlation cutoffs for SEMplMe were defined as the scrambled baseline for the final iteration of SEMpl.
Illumina 450 k microarray M-values are previously published and publically available [21]. Microarray probes were mapped to the genome and transcription factor binding to each probe loci was determined using the JASPER CORE 2022 predicted transcription factor binding site database [22]. 61 bp sequences encompassing the predicted CTCF binding sites were scored using CTCF specific matrices from SEMplMe, SEMpl, and Methyl-Spec-seq.
Correlation comparisons were completed using the cocor R package using one tailedtests and a 0.95 CI [23].

SEMplMe provides quantitative predictions based on in vivo measures of binding affinity
SEMplMe integrates endogenous functional data encompassing transcription factor binding, open chromatin, and DNA methylation to provide a quantitative prediction of the effect of methylation on transcription factor binding affinity at every position within a binding motif. By including measures of DNA methylation, SEMplMe is able to calculate the relative average transcription factor binding affinity of methylated genomic sequence by using a weighted sum of ChIP-seq signal and the proportion of methylation at the site from WGBS (Fig. 1). Averaging this signal genome-wide for methylated and unmethylated sequence separately allows for the generation of a quantitative prediction matrix of the effect methylation has on transcription factor binding affinity (Fig. 1C). SEMplMe represents an advancement over currently existing methods as its predictions are generated from in vivo functional data, it is generally accessible without additional experimental work, and the resulting matrix is both quantitative for a single position and across an entire motif.

SEMplMe recapitulates differences in methylation sensitivity between transcription factors
Transcription factor differences in methylation sensitivity were examined by calculating the absolute difference between methylated and unmethylated bases at each position within SEMplMe matrices for previously classified methylation sensitive and insensitive transcription factors. Methylation sensitive transcription factors examined here include CREB, cMYC, USF, NFkB, E2F, MYC, and ZFX [2,11,24,25]. Methylation insensitive transcription factors examined here include SP1, REST, CEBPa, FOXA1, RXRA, and ARNT2 [2,8,11,[25][26][27]. As expected, transcription factors previously associated with methylation sensitivity show a larger average difference in SEM scores between C and M, and G and W nucleotides compared to transcription factors previously defined as insensitive (Fig. 2). This pattern is driven by methylation sensitivity across an entire Altogether, this suggests that prior definitions of methylation sensitivity and insensitivity may reflect general trends of transcription factor methylation sensitivity.

DNA methylation drives cell type specific transcription factor binding
DNA methylation is hypothesized to contribute to cell type specific transcription factor binding by altering the availability of DNA sequence. In support of this, the ChIPseq and WGBS cell type used in SEMplMe analysis was found to influence the output plot for a transcription factor known to have cell-specific activity. JUN, a transcription factor known to be differentially regulated in HepG2 cells, shows high correlation of SEMplMe outputs for methylated sites (MW) between H1-hESC and K562 cell lines (R 2 = 0.91), and a reduced correlation to HepG2 (R 2 = 0.43) (correlation comparison: HepG2 v. H1-hESC p-value < = 0.01, HepG2 v. K562 p-value < = 0.01, H1-hESC v. K562 p-value < = 0.07) (Fig. 3) [28]. This is supported by MethMotif data, in which JUN shows many more methylated binding sites, most of which fall into a mid-to highly-methylated state in HepG2, as opposed to comparatively few overlapping methylated sites in K562 and H1-hESC [15]. This pattern of reduced correlation was not observed when looking across the entire SEMplMe output, suggesting methylated sites are driving this difference (Additional file 1: Figure S1). Of note, this pattern is not seen for another transcription factor, CEBPB, where the SEMplMe output for methylated sites is highly correlated between all cell types examined (K562, IMR-90, HepG2, and GM12878), suggesting that not all transcription factors are subject to cell type specificity due to methylation differences (Additional file 1: Figure S2 Interestingly, SEMpl data without methylation appears to be primarily cell type agnostic, providing evidence that methylation plays a meaningful role in cell type specificity for only some transcription factors [17].

SEMplMe validation using in vitro measures of transcription factor binding affinity
To evaluate SEMplMe on a metric external to ChIP-seq data, our predictions were compared to previously published PBM data, which has been used by previous studies to examine the occupancy of individual transcription factors to potential target sequence in vitro [8,9]. SEMplMe predictions were compared to microarray Z-scores data from PBMs, which represent transcription factor binding affinity to an individual oligo which is either methylated or unmethylated. The best-bound 8-mers for each of the 8 transcription factors with both PBM Z-scores and SEMplMe scores were assessed. A modest level of agreement was observed between SEMplMe predictions and PBM data for 16 DNA sequences across 8 transcription factors (Fig. 4A) (R 2 = 0.67) (CEBPA, CEBPB, CEBPD, CREB1, ATF4, JUN, JUND, CEBPG) [9]. This agreement is reduced when using SEMpl scores without methylation (R 2 = 0.56), suggesting that the inclusion of methylation in our model improves scores for methylated sequences (Fig. 4B) (correlation comparison: p-value = 0.33). Discrepancies between SEM predictions and PBM data may be attributed to differences in in vivo versus in vitro methods of generation.
To further functionally validate SEMplMe, data from in vitro electrophoretic mobility shift assays (EMSAs) were utilized to examine our predictions. EMSA data can be used to analyze the binding of transcription factors in the presence of variation in a quantitative manner by comparing dissociation constants between oligos of interest [Zuo,  [10]. B. SEMpl shows a reduced correlation with PBM data compared to SEMplMe (R 2 = 0.56), suggesting the addition of methylation data improves methylated sequence predictions. C. SEMplMe predictions correlate with previously published electrophoretic mobility shift assay (EMSA) data for methylated, hemi-methylated, and unmethylated binding sites for ATF4 and CEBPB (R 2 = 0.65) [9]. Grey ribbons represent Pearson's 0.95 CI 2017]. Previously published EMSA data was evaluated for two transcription factors, ATF4(CREB) and CEBPB. This measure of in vitro binding showed marginal agreement with our predictions (R 2 = 0.65) (Fig. 4C) [9]. This observed low agreement is driven entirely by CEBPB which has relatively low correlation with our predictions (R 2 = 0.17), as opposed to ATF4 (R 2 = 0.92). CEBPB has been reported to preferentially bind to methylated sequence, thus the discrepancy in predictions has previously been thought to be a result of limited genome methylated sequence availability, a necessity for calculating more accurate predictions in SEMplMe [9]. SEMplMe identified comparatively few methylated sites throughout the genome, leading to a much higher standard deviation for the effect of methylated sites (Additional file 1: Figure S3). This unavailability of methylated sites is consistent with previous data showing methylated CEBPB motifs to bind well in vitro, but poorly in vivo [29].

SEMplMe predictions are consistent with previous findings for CTCF
CTCF is a well studied transcription factor previously shown to be methylation sensitive [30,31]. CTCF binding predictions using SEMplMe found the majority of positions to be methylation sensitive for both M and W. Notably, a handful of sites had methylated sequence scores at or slightly above their unmethylated counterparts, and likely represent methylation insensitive positions. These results are consistent with CTCF's role as a methylation sensitive transcription factor. As CTCF is widely used in research studies, its binding to sites containing methylated positions within its motif have been previously surveyed by a variety of methodologies, including qualitative EMSA, observation  [4,12,32] of binding following demethylation of cells, and SELEX-based methods [4,12,30,32]. When SEMplMe results were compared to measures of binding at individual positions within the CTCF motif, a general agreement was observed for the direction of binding for all positions predicted to decrease binding affinity (Fig. 5). Though the majority of sites identified by previous studies within the CTCF motif were found to be overwhelmingly methylation sensitive, two sites were predicted to lead to increased binding affinity when methylated. Though SEMplMe did not identify these positions, one site overlaps a SEMplMe position consistent with methylation insensitivity, and the other was found to not significantly increase binding by all prior studies [4]. Overall, our predictions are consistent with previous studies of CTCF binding direction.
Correlation between the entirety of the CTCF matrices generated by SEMplMe and the recently published Methyl-Spec-seq method, which uses in vitro SELEX to predict methylation effects on transcription factor binding affinity, was assayed (R 2 = 0.56) (Additional file 1: Figure S4) [12]. SEMplMe outperformed Methyl-Spec-seq by performance comparison when comparing scores across entire kmers to their average ChIPseq signal (SEMplMe R 2 = 0.25, Methyl-Spec-seq R 2 = 0.04, correlation comparison p-value < = 0.01) ( Fig. 6A and B). The kmer set used is associated with active CTCF binding and includes both methylated and unmethylated sequences. Additionally, we compared SEMplMe SEM scores, SEMpl SEM scores without methylation, and Methyl-Spec-seq ePWM scores to in vitro metrics of CTCF protein binding by analyzing all motifs corresponding to CTCF, including to 6 methylated and 6 unmethylated oligos, from a previously published Illumina 450 k microarray. In doing so we found SEM-plMe predictions to have a more robust inverse correlation with methylation occupancy (SEMplMe R 2 = 0.176, SEMpl R 2 = 0.04, MethylSpec-seq R 2 = 0.096, correlation comparison p-values > 0.132) (Fig. 6C-E) [21]. This is expected, as CTCF is predicted to be methylation sensitive along the majority of its motif. Interestingly, though no statistical differences in correlation comparisons were noted with this small sample size, both tools implementing methylation data outperformed SEMpl without methylation, supporting the importance of including measures of methylation in determining transcription factor binding disruption in the presence of methylation. Altogether, this provides further evidence that predictions of change to transcription factor binding affinity perform better when generated from in vivo data, rather than in vitro data such as from SELEX methods.

Discussion
SEMplMe is poised to advance our understanding of the effects of methylation on transcription factor binding affinity through its generation of quantitative predictions using in vivo functional data. SEMplMe will both improve our ability to predict putative disease loci affected by aberrant DNA methylation and increase predictions of transcription factor binding affinity in general [25]. This is expected to hold true regardless of whether reduced methylation in a transcription factor's motif contributes to its binding or is caused by its binding [33]. The nucleotide W was included to capture not just position dependent, but stand dependent methylation, as strand specificity due to hemimethylation has previously been found to influence transcription factor binding [12]. This is likely driven by changes in DNA structure. SEMplMe has similar limitations to its predecessor SEMpl, such as a dependence on available ChIP-seq, DNase-seq, and WGBS data. It is further restricted by the limited number of methylated sites in the genome available for use in generating models of binding. In instances where few sequence specific sites also contain methylation, our measure of standard deviation increases considerably. Though the low confidence in these sites can be visualized by error bars, predictions of methylation at these loci are limited. Cell type should be carefully considered before running SEMplMe for optimal predictions as cell type specificity contributes to the final SEMplMe plot, and methylation sensitivity has been previously found to be paralog specific [13]. Additional cell type specific factors such as transcription factor co-binding may also affect the final SEMplMe plot as seen for MYC in SEMpl, though is expected to be rare due to the genome-wide nature of the predictions [17]. Starting PWMs should also be carefully considered for transcription factors known to recognize different methylated and unmethylated motifs [8].
The inclusion of CpG methylation provides additional information to help fully understand context-specific transcription factor binding. However, the addition of more nuanced molecular mechanisms that contribute to transcription factor binding are likely to further improve our predictions. This includes additional types of DNA methylation, such as hydroxymethylation and nonCpG methylation, as well as measures of structural changes to the genome [10,13,[34][35][36].

Conclusions
DNA methylation is a key epigenetic mark known to act in a regulatory capacity, allowing transcription factors to bind in a cell-type specific manner. Counter to the idea that all methylation is able to disrupt transcription factor binding, recent studies have revealed that certain methylated loci impact binding more than others. Predicting the locations of these methylation sensitive loci and quantifying the effect of methylation on transcription factor binding affinity is challenging. Here we introduce an expansion to our previously released software SEMpl, called SEMplMe, which integrates predictions of the effect of cytosine methylation on transcription factor binding affinity based on WGBS data. These predictions agree with in vitro data of transcription factor binding, are cell-type specific, and show a general agreement with data from transcription factors previously annotated as methylation sensitive and insensitive.
The improved predictions provided by SEMplMe will contribute to a better understanding of the key positions within transcription factor binding sites affected by DNA methylation. These predictions are accessible and can be generated without undertaking methylation specific binding assays. This advancement is central to improving our ability to prioritize mutations associated with aberrant methylation contributing to human disease.

CpG
Cytosine-phosphate-guanine PBMs Protein binding microarrays SELEX Systematic enrichment of ligands by exponential enrichment WGBS Whole genome bisulfite-seq SEMplMe SNP effect matrix pipeline with methylation ChIP-seq Chromatin immunoprecipitation followed by deep sequencing DNase-seq DNase I hypersensitive site sequencing EMSA Electrophoretic mobility shift assay